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Abstract —Graphics Processing Units (GPUs) consisting of 
Streaming Multiprocessors (SMs) achieve high throughput by 
running a large number of threads and context switching 
among them to hide execution latencies. The number of thread 
blocks, and hence the number of threads that can be launched 
on an SM, depends on the resource usage-e.g. number of 
registers, amount of shared memory-of the thread blocks. 
Since the allocation of threads to an SM is at the thread 
block granularity, some of the resources may not be used up 
completely and hence will be wasted. 

We propose an approach that shares the resources of SM to 
utilize the wasted resources by launching more thread blocks. 
We show the effectiveness of our approach for two resources: 
register sharing, and scratchpad (shared memory) sharing. We 
further propose optimizations to hide long execution latencies, 
thus reducing the number of stall cycles. We implemented 
our approach in GPGPU-Sim simulator and experimentally 
validated it on several applications from 4 different benchmark 
suites: GPGPU-Sim, Rodinia, CUDA-SDK, and Parboil. We ob¬ 
served that with register sharing, applications show maximum 
improvement of 24%, and average improvement of 11%. With 
scratchpad sharing, we observed a maximum improvement of 
30% and an average improvement of 12.5%. 

/feywords-Register Sharing; Scratchpad Sharing; Warp 
Scheduling; Thread Level Parallelism 

I. Introduction 

Graphics Processing Units (GPUs) have been effectively 
used to accelerate large data parallel applications. GPUs 
consisting of Streaming Multiprocessors (SMs) achieve high 
throughput by concurrently executing a large number of 
threads to hide long latencies. The throughput achieved by 
a GPU depends on the amount of thread level parallelism 
(TUP) utilized by it. Recent studies iH ||2l IJl ||4l focus on 
improving the throughput of GPUs by exploiting the TUP 

The amount of TUP utilized by a GPU depends on the 
number of threads resident on it. When an application is 
launched on a GPU, an execution configuration consisting 
of the number of thread blocks and number of threads in a 
thread block is specified. The number of thread blocks that 
can actually be launched on an SM depends on the resource 
requirement, such as the number of registers and the amount 
of shared memory needed by each thread block. If an SM 
contains R units of a resource and a thread block requires 
Rtb units to complete its execution, then the SM can launch 
at the most [R/Rtbl thread blocks, utilizing Rtb x [R/Rtb] 
units. The remaining R mod Rtb units are wasted. 

In this paper we propose a mechanism for sharing of 
resources of SM in order to launch more thread blocks that 


reduces the wastage of resources. In particular, we show 
how sharing of registers and sharing of scratchpad improves 
the throughput of SMs. It is observed JT] that increasing 
the number of threads benefits compute bound applications, 
but may result in increased L1/L2 cache misses for memory 
bound applications, thereby decreasing their performance. 
To overcome this, we propose an optimization, called Owner 
Warp First (OWE) that schedules the extra thread blocks 
and their constituent warps effectively. For the register 
sharing approach, we further propose two optimizations, 
viz.. Unrolling and Reordering of Register Deelaration and 
Dynamie Warp Exeeution that improves register utilization 
and minimizes the number of stall cycles observed by the 
additional thread blocks respectively. 

A. Motivation 

The problem of resource underutilization occurs in GPU, 
because resources are allocated at thread block granular¬ 
ity. We analyzed several benchmark applications using the 
GPGPU-Sim ||5] simulatoiQ. For applications that are limited 
by register resource, we show the number of resident thread 
blocks per SM in Figure [Tf a), and we show the percentage of 
registers that are unutilized per SM in Figure [Hb). Consider 
the application hotspot. Each thread for this benchmark 
needs 36 registers, and there are 256 threads in each block, 
so the number of registers required per thread block is 9216 
(36 * 256). According to the configuration (Table |I|i, the 
number of registers available on an SM is 32768, so an SM 
can fit only 3 threads blocks ([ This results in 5120 

registers per SM are wasted. 

Similarly, in Figure [TJe) we show the number of resident 
thread blocks per SM for the applications that are limited 
by scratchpad resource, and in Figure [TJd) we show the 
percentage of scratchpad memory that remains unutilized 
per SM. Consider the application lavaMD. Each thread block 
for this benchmark needs 7200 bytes of scratchpad memory. 
According to the configuration in Table |I] the amount of 
scratchpad memory available per SM is 16384 bytes, hence 
an SM can fit only 2 threads blocks. This results in 1984 
bytes of memory per SM remaining unutilized. Similar 
behavior is observed for other applications as well. 

Applications that are constrained by their resource re¬ 
quirements may not only have low residency, but also waste 

^The GPU configuration is described in Table m Section IVI-AI The 
benchmai'k details are given in Table m Table Iml Section lyUAl 
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Figure 1: (a) Number of resident thread blocks with limited registers (b) Register underutilization 
(c) Number of resident thread blocks with limited scratchpad memory (d) Scratchpad underutilization 


resources of GPU. Our approach uses sharing to reduce 
the number of unutilized resources in order to increase the 
number of resident thread blocks. Our experiments show 
that these extra thread blocks help to hide long execution 
latencies and increase throughput. 

B. Contributions 

We make the following contributions in this work; 

1. To utilize the resources of GPUs effectively, we propose a 
novel resource sharing mechanism that enables launching 
of more thread blocks per SM. 

2. We implemented our approach for two resources i.e, 
registers and scratchpad. We propose optimizations to 
further improve the throughput of applications. 

3. We implemented our approach using GPGPU-Sim simu¬ 
lator and evaluated on several applications from GPGPU- 
Sim, Rodinia, CUDA-SDK, and Parboil benchmark 
suites. We observed an average improvement of 11% with 
register sharing and an average improvement of 12.5% 
with scratchpad sharing. 

The paper is organized as follows; Section |II] describes 
the background required for our approach. Our approach is 
presented in Sections HII] and IIVI Section |V] discusses hard¬ 
ware overhead for implementing our approach. Section IVII 
describes the experimental evaluation. Section IVlIl discusses 
related work and Section IVIIII concludes the paper. 

II. Background 

A typical NVIDIA GPU i) consists of a set of Stream¬ 
ing Multiprocessors (SMs), and each multiprocessor has 
execution units called Stream Processors (SPs). CUDA 0 
supports extensions to languages, such as C, to allow pro¬ 
grammers to define and invoke parallel functions, called 
kernels, on a GPU. A kernel is invoked along with an 
execution configuration of threads that specifies the number 
of threads per thread block and the number of thread blocks. 

The number of thread blocks that can reside on an SM 
depends on; (a) the number of registers used by a thread 
block and the number of registers available in the SM, (b) 
the amount of scratchpad memory used by a thread block 
and the amount of scratchpad memory available in the SM, 

(c) the maximum number of threads allowed per SM, and 

(d) maximum number of thread blocks allowed per SM. The 


threads in a thread block are further divided into a set of 
consecutive 32 threads called Warp. Each SM contains one 
or more warp schedulers which schedule a ready warp every 
cycle from a pool of ready warps. All threads in a warp 
execute the same instruction. 

Warp schedulers schedule instructions in-order and so, 
when the current instruction of a warp can not be issued, 
the warp is not considered to be ready. If no warp can 
be scheduled in a cycle, then that is a stall cycle. As the 
number of stall cycles increases, the run time goes up and the 
throughput decreases. Our approach increases the number 
of resident thread blocks by utilizing the wasted registers as 
well as scratchpad memory on each SM and hence increases 
the number of resident warps and also improves the warp 
schedulers to hide long latencies. 

III. Resource Sharing 

We can increase the number of thread blocks in an 
SM by allowing two thread blocks to share resources. For 
example, consider an application that has thread blocks of 
size 10 warps (320 threads), and a thread block requires lOK 
resource units to complete its execution. If an SM has 35K 
resource units, at most 3 thread blocks can be resident on 
each SM by utilizing 30K resource units; the remaining 5K 
units are wasted. The schematic of this approach (baseline) 
is shown in Figure|2|a), where thread blocks TBo,TBi and 
TB 2 are scheduled on an SM. 

In order to reduce the wastage of resources, our approach 
allocates one more thread block (TB 3 ) in sharing mode with 
TB 2 . Instead of allocating lOK resource units separately to 
each of the thread blocks TB 2 and TB 3 , a total of 15K units 
for the two blocks are allocated as follows; each of TB 2 and 
TB 3 is allocated 5K units exclusively {Private or Unshared 
Resource), while the remaining 5K units {Shared Resource) 
are all allocated to TB 2 or TB 3 whoever needs any one of 
these resource first. The thread block which did not get the 
ownership of shared resources waits when it needs any of 
the shared resource till the other block finishes. 

We now describe in detail our approach for two types of 
resources (a) Register sharing (b) Scratchpad sharing. 

A. Register Sharing 

The scenario in Figure 12 a) can be improved using our 
register allocation scheme shown in Figure I2b), in which 













Figure 2: Approaches to Resource Allocation (a) Default Approach (b) Register Sharing (c) Scratch Pad Memory Sharing 


we allocate lOK registers to each thread blocks TBg and 
TBi. The remaining 15K registers are shared between thread 
blocks TB 2 and TB 3 such that each pair of warps in 
these thread blocks are allocated 1.5K registers as described 
next. We refer to TBO and TBI as unshared thread blocks, 
whereas, TB2 and TBS as shared thread blocks. 

Consider the pair of warps W20 and that participate 
in sharing. Our approach allocates 0.5K registers each to 
IT 20 and ITao exclusively (private or unshared registers). 
The remaining 0.5K registers are shared registers, that are 
allocated to these warps together in a shared but exclusive 
manner, i.e. only one of them can access the pool of shared 
registers at a time. For example, if warp W 20 accesses any 
of the shared registers first, exclusive access to all the 0.5K 
shared registers is given to W 20 , while W 30 is prevented 
from accessing any of those 0.5K shared registers till W 20 
finishes. This implies, W 30 can continue its execution until 
its first access to any of the 0.5K shared registers, at which 
point it busy-waits. Only after W 20 finishes execution, W 30 
can access the shared registers and continue. This way, 
additional warps make some progress, which helps in hiding 
execution latencies. 

To generalize this idea and to compute the increase in 
number of thread blocks, we will consider a GPU that 
provides R registers per SM. Also, consider a thread block 
that requires Rtt registers, and each warp in the thread block 
requires R^ registers to complete its execution. To increase 
the number of thread blocks that share registers with other 
existing thread blocks in the SM, we allocate -f t) 

(for any threshold 0 < t < 1 ) registers to each pair of 
shared thread blocks, instead of allocating 2 Rtb registers 
to them (in Figure I2b), t is 0.5). Equivalently we allocate 
-I- 1) registers per two warps from these thread blocks 
(i.e, one warp from each shared thread block in the pair), 
such that each of these warps can access R^t unshared 
registers independently, and they can access the remaining 
— t) shared registers only when granted access. 

To detect a register accessed by a warp as shared or 
unshared, and to efficiently access it from the register file 
unit, we modify the existing register file access mechanism 
as shown in Figure [3 When a warp (Warpid) needs to access 
a register (RegNo), we first check if the warp is an unshared 
warp i.e. if it belongs to an unshared thread block (Figured 



Figure 3: Register Access Mechanism 



Figure 4: Scratchpad Access Mechanism 


Step (b)). If it is an unshared warp, it can directly access the 
register from register file using a combination of (Warpid, 
RegNo). If Warpid is a shared warp, the accessed register 
is an unshared register if RegNo < R^t (Step (c)). This 
is because R^t number of unshared registers are allocated 
to each warp. If RegNo > Rwt, we treat the register as 
a shared register. A warp can access an unshared register 
directly from the register file, but it can access a shared 
register only when it gets exclusive access by acquiring a 
lock (Step (e)), otherwise it retries the access in another 
cycle. 

Consider a scenario shown in Figure |5] where two thread 
blocks TBi and TB 2 are in shared mode. Assume that warp 
Wi of TBi tries to acquire a lock to access shared registers, 
but W 3 has already acquired the lock. Also, assume that the 
warps W 2 , W 3 are waiting for warps Wi and W 4 respec¬ 
tively, to arrive at a barrier instruction (_syncthreads()). Now, 
if warp IT 4 tries to acquire a lock to access shared registers 
from W 2 , and W 2 has already a lock, then a deadlock occurs. 
To avoid deadlock, we always ensure that if thread blocks 
TBi and TB 2 share registers, then a warp from TBi can 
acquire a lock only when either (a) none of the warps from 
TB 2 have acquired a lock for the shared registers, or (b) 
the warps from TB 2 that have acquired exclusive access to 
the shared registers have finished their execution. For the 
example, since warp W 2 already has acquired a lock, W 3 
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Figure 5: Deadlock in the presence of barrier instructions 

can not acquire a lock, hence avoiding the deadlock. 

B. Scratchpad Sharing 

Figure |2c) shows an example of Scratchpad Sharing, 
where we consider a GPU that has 35K units of scratchpad 
memory per SM, and each thread block requires lOK units. 
To increase number of thread blocks, we allocate lOK units 
to each TBq and TBi. We allocate the remaining 15K 
scratchpad units together for thread blocks TB 2 and TB 3 . 
With scratchpad sharing, each of the thread block TB 2 
and TB 3 can access 5K scratchpad memory each privately. 
Remaining 5K units of memory is allotted to the first block 
that accesses i0 Similar to Register sharing approach, we 
refer to TBO and TBI as unshared thread blocks, whereas, 
TB2 and TBS as shared thread blocks. When a thread from 
the shared thread block (say TB 2 ) needs to access a memory 
location from shared scratchpad, it gains an exclusive access 
by acquiring a lock. Any thread from the other shared thread 
block (TB 3 ) can continue to execute instructions till the 
first access of shared scratchpad location, at which point all 
threads of TB 3 busy-wait till TB 2 has finished its execution. 

The implementation to support scratchpad sharing in 
GPGPU-Sim is shown in Figure |4] 

The steps for the shared scratchpad access follow the 
rules similar to the shared register access, and are omitted 
for brevity. Note that a deadlock can never occur with 
scratchpad sharing. 

C. Computing No of Thread Blocks to be Launched per SM 

A naive method of sharing, where each thread block is 
sharing resources with some other thread block, may launch 
more thread blocks as compared to default (non-sharing) 
approach. However, the number of thread blocks that make 
progress {effective thread blocks) per SM can be less than 
that for non-sharing. For example, consider a scenario where 
3 thread blocks are resident per SM without sharing. With 
naive sharing approach, it may be possible to have 4 thread 
blocks resident, such that block 1 shares resources with 
block 2; and block 3 shares resources with block 4. It can 
happen that block 2 and 4 start accessing shared resources 
causing blocks 1 and 3 to wait. Effectively only two thread 

^Unlike register sharing, we can not distribute 1.5K scratchpad memory 
to each pair of warps, because any thread within a thread block can access 
any scratchpad location allocated for that block. 


blocks (blocks 2 and 4) will make progress in the naive 
sharing approach, whereas all 3 blocks can make progress in 
the non-sharing approach reducing the throughput. To avoid 
this, we describe a method to compute the total number 
of thread blocks (Shared - 1 - Unshared) to be launched per 
SM such that the number of effective thread blocks using 
sharing is no less than that of non-sharing approach. We use 
the following notations: 

1. R: Number of units of resource available per SM, 

2. Rtb- Number of resource units required by a thread block, 

3. S: Number of pairs of thread blocks that are to be 
launched per SM in shared mode, 

4. U: Number of thread blocks to be launched in an SM 
that do not share resources with any other thread block, 

5. M: Maximum number of thread blocks to be launched in 
an SM, 

6 . t: Threshold for computing the number of resources that 
a thread block shares with another thread block. For a 
given threshold value f (0 < f < 1) we allocate {l + t)Rtb 
resource units per two shared thread blocks, in which we 
use (1 — t)Rtb resource units as shared units. 

Without sharing, we can launch up to [R/Rtb) thread 
blocks in an SM, and all of them make progress. Whereas 
in our approach, if two thread blocks are launched in sharing 
mode, at least one thread block always makes progress. So, 
when S shared pairs are launched in an SM, at least S thread 
blocks always make progress. Also, if U unshared thread 
blocks are launched in the SM, they always make progress. 
Therefore, at least S+U thread blocks always make progress 
with our approach. In order to keep the number of effective 
thread blocks in our approach to be same as that of no¬ 
sharing approach, we need the following relation to hold: 

S + U=[R/Rtb\ (1) 

For each shared pair of thread blocks, we allocate -f 

t) resource units. Similarly for each unshared thread block, 
we allocate Rtb resource units. Since the total number of 
resource units available in the SM is R, we have: 

URtb + SRtb{l + t)<R (2) 

The total number of thread blocks that can be launched in 
sharing approach is equal to the number of unshared thread 
blocks plus twice the number of shared pairs, i.e, 

M = U + 2S (3) 

Using Equations [T] 121 and [3] 


Since the actual number of thread blocks that can reside 
in an SM also depends on (a) maximum number of resident 
threads per SM, and (b) maximum number of resident thread 
blocks in the SM; the number of thread blocks resident in 
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Figure 6: Warp Scheduling 


an SM in our approach is the minimum of numbers obtained 
using these factors and the value of M. 

IV. Optimizations 

With the proposed resource sharing approach, each SM 
has unshared and shared warps, and scheduling these warps 
plays a very important role in determining the performance 
of applications. We propose an optimization called “Owner 
Warp First (OWF)” to schedule these warps effectively. If 
two thread blocks TB^ and TBj are a shared pair, and at 
least one of the warps of TB^ waits for shared resources from 
TBj, we call TBj as Owner Block, and the warps that belong 
to TBj are called Owner Warps. TB^ is called Non-Owner 
Block and warps of TB^ are called Non-Owner Warps. As 
soon as the owner thread block finishes its execution, it 
transfers its ownership to the non-owner thread block (i.e, 
the non-owner thread block becomes the owner), and a new 
non-owner thread block gets launched. 

A. Scheduling Owner Warp First (OWF) 

A warp scheduler in the SM issues a warp every cycle 
from a pool of ready warps. With our solution, the warps 
can be categorized into three types viz., unshared, shared 
owner and shared non-owner. In register sharing, shared 
non-owner warps depend on the corresponding shared-owner 
warps to release registers, before they can make progress. 
Similarly with scratchpad sharing, warps from non-owner 
thread blocks wait for owner thread blocks to complete 
their execution. In Owner Warp First algorithm we prioritize 
warps in the order; shared owner, unshared, and shared non- 
owner. Giving the highest priority to shared owner warps 
helps finish them sooner, and hence the dependent shared 
non-owner warps can make progress. Since shared non- 
owner warps can not make much progress before stalling, 
giving them lower priority than unshared warps helps use 
them only to hide stalls when no other types of warps are 
ready to run. Figure |6] illustrates that scheduling unshared 
warps with high priority compared to shared warps can 
degrade the performance of an application. 

B. Unrolling and Reordering of Register Declarations 

In register sharing, non-owner warps need to wait for 
owner warps when they try to access shared registers. If 
the very first instruction issued by a non-owner warp uses a 
shared register, then the warp has to wait and can not start 


.reg .u32 $r<27>; 

.reg .u32 $ofs<3>; 
.reg .pred $p<4>; 
.reg .u32 $rl24; 
.reg .u32 $ol27; 


set.le.s32.s32 $p0/$ol27, 
s[0x003c], $rl24; 
mov.u32 $rl6, $rl24; 
mov.u32 $rl7, $rl24; 
mov.u32 $r9, $rl24; 
mov.u32 $rl8, $rl24; 
mov.u32 $rlO, $rl24; 

/* Code here */ 

(a) Normal Declarations 


.reg .pred $p0; 

.reg .u32 $ol27; 

.reg .u32 $rl24; 

.reg .u32 $rl6; 

.reg .u32 $rl7; 

.reg .u32 $r9; 

.reg .u32 $rl8; 

.reg .u32 $rlO; 

.reg .u32 $rll; 

set.le.s32.s32 $p0/$ol27, 
s[0x003c], $rl24; 
mov.u32 $rl6, $rl24; 
mov.u32 $rl7, $rl24; 
mov.u32 $r9, $rl24; 
mov.u32 $rl8, $rl24; 
mov.u32 $rlO, $rl24; 

/* Code here */ 

(b) Unrolled Declarations 


Figure 7: Unrolling and Reordering of Register Declarations 


its execution until corresponding owner warp has released 
the shared register. In order to allow the non-owner warps to 
execute as many instructions as possible before stalling due 
to unavailability of shared registers, we unroll and reorder 
the register declarations. To illustrate this, consider the 
PTXPlus Q code shown in Figure [TJa), which is generated 
by GPGPU-Sim la for the sgemni application from Parboil 
Suite Q. The first instruction of the code accesses registers 
pO and rl24, which get the register sequence numbers as 
31 and 35 according to the declaration. These registers are 
part of the shared registers for a certain threshold value t. 
Hence, a non-owner warp has to wait until the registers are 
released. To delay accessing the shared registers, we unroll 
and rearrange the order of the register declarations so that 
pO, rl24 become unshared registers (i.e, they get the register 
sequence numbers as 1 and 3). This is shown in Figure|3b). 
Hence the non-owner warps get to execute more number of 
instructions before they start accessing shared registers. 

To implement this optimization, we converted the as¬ 
sembly code (PTXPlus) produced by GPGPU-Sim into an 
optimized assembly code. To achieve this, we first find an 
order of registers according to their first usage. Further, 
to ensure that the unshared registers are used before the 
shared registers, we modify the register declarations so that a 
register that has been used first in assembly code is declared 
first. Finally, we modified the GPGPU simulator to use the 
optimized PTXPlus code for simulating instructions. This 
optimization can be easily integrated at the assembly level 
using the existing CUBA compiler. 

C. Dynamic Warp Execution 

Recent study by Kayiran et. al. Ill shows that the per¬ 
formance of memory bound applications can degrade with 
increase in the number of resident thread blocks. Executing 
additional thread blocks can increase L1/L2 cache misses, 
which leads to increase in the number of stall cycles. Regis¬ 
ter sharing approach exploits more thread level parallelism 
from the additional warps (non-owner warps) compared to 
scratchpad sharing, because register sharing allows non- 
owner warps to start executing as soon as its corresponding 











owner warp finishes. So, in order to reduce the number 
of stalls raised due to execution of non-owner warps, we 
propose an optimization that can dynamically enable or 
disable the execution of long latency instructions (memory) 
issued by the additional blocks. 

To control the execution of memory instructions from the 
non-owner warps, we monitor the number of stall cycles 
for each SM. When executing memory instructions from 
non-owner warp leads to increase in the number of stalls, 
we decrease the probability of executing further memory 
instructions from the non-owner warps. To illustrate this, 
consider a GPU that has N SMs, all in sharing mode. 
Our approach disables execution of memory instructions 
for the non-owner warps, only on a specihc SM (e.g. 
SMq). Every other SM, SM^ for i G {1... iV — 1}, allows 
execution of memory instructions for the non-owner warps, 
and compares its stall cycles periodically with the stalls on 
SMq. If stalls observed in the SM^ are more than the stalls 
appearing in SMq, then the probability of executing memory 
instructions on SMj from the non-owner warps is decreased 
by a predetermined value p. If the stalls in SM^ are less 
than that in SMq, then the probability of executing memory 
instructions on SM^ from the non-owner warps is increased 
by the same value p. Thus, we reduce the number of stall 
cycles by controlling the execution of memory instructions. 

After running several experiments, we selected the period¬ 
icity of monitoring to be 1000 cycles, which is to ensure that 
(a) the monitoring overhead is not high, and (b) sufficient 
number of stall cycles are observed. In our experiments, 
initially all the SMs are allowed to execute all memory 
instructions, i.e., the probability of executing memory in¬ 
structions from non-owner warp is 1. Depending on the 
stall cycles observed for an SM^ (i G {1... — 1}), this 
probability for SM^ is decreased or increased by p = 0.1, 
but is kept within interval [0 — 1] as a saturating counter. 

V. Hardware Requirement 

For both register and scratchpad sharing: (1) Each SM 
requires a bit to specify whether sharing mode is enabled 
for it. This bit will be set when the number of thread 
blocks assigned to the SM using resource sharing is more 
than the default number of thread blocks per SM. (2) Each 
thread block stores id of the partner thread block (set to - 
1 if the thread block is in unsharing mode). For T thread 
blocks, T|’log 2 (T-|-l)] (Assuming ids 0 to T-1 for T thread 
blocks, we can use id T to represent -1) bits are required per 
SM. (3) Each warp requires a bit for specifying the owner 
information. This bit is set only when the warp is an owner 
warp. Hence for W warps, W bits are needed. 

For register sharing: (1) Each warp requires a bit to 
specify whether it is in sharing or unsharing mode. Hence 
for W warps in an SM, W bits are required. For a warp in 
shared mode, its corresponding shared warp can be identified 
using the sharer thread block id of its thread block and its 


relative position in the thread block. (2) Each pair of shared 
warps uses a lock variable to access the shared registers 
exclusively. The lock variable is set to the id of the warp 
which has gained access to the shared registers. If an SM has 
W warps, there can be a maximum of [1U/2J shared pairs of 
warps in the SM. Hence we need a total of [1U/2J [log 2 W~\ 
bits per SM. 

For scratchpad sharing: (1) Each pair of shared thread 
blocks uses a lock variable to access the shared locations 
exclusively. The lock variable is set to the id of the thread 
block which has gained access to the shared scratchpad 
region. If an SM has T thread blocks, there can be a 
maximum of [T/2J shared pairs of thread blocks in the 
SM. Hence we need a total of [T/2J [log 2 T] bits per SM. 

Hence the total amount of storage required (in bits) for a 
GPU with N SMs for implementing register sharing is: 

(l-bTriog 2 (r-bl)] +2W+[WI2\ [logalU]) * Af 

and for implementing scratchpad sharing is: 

(i + r[iog2(T + i)l +1U+ Lr/2J [iog2Tl)*Ar 

We also require two comparator circuits to implement the 
resource accesses (Figures [3] and @1 steps (b) and (c)). 

VI. Experimental Analysis 
A. Evaluation Methodology 

We implemented our approach using GPGPU-Sim V3.X 
[8]. The baseline architecture used for comparing our ap¬ 
proach is shown in Table I. We experimentally evaluated 
our approach on several applications from GPGPU-Sim [8], 
Rodinina [9], CUDA-SDK [10], and Parboil [7] benchmarks. 
Depending on the resource requirement of applications, we 
divided the benchmarks in three sets. Set-1, shown in Table 
II, consists of applications whose number of thread blocks 
per SM is limited by the registers. Similarly Set-2, as show 
in Table III, has applications that are limited by scratch 
pad memory. Finally Set-3, shown in Table IV, consists 
of applications whose number of thread blocks per SM is 
limited by factors other than registers, such as (a) maximum 
number of resident threads (b) maximum number of resident 
thread blocks. For each application in the Tables II, III, 
and IV, we show names of the kernels used for evaluation, 
number of threads per thread block. In Table II, we report 
the the number of registers per thread for each kernel, which 
GPGPU-Sim uses to compute the number of resident thread 
blocks, and in Table III we show the amount scratch pad 
memory used by each thread block. We compiled all the 
applications using CUDA 4.2 and executed on the inputs 
provided in the benchmark suites. 

We use the value of threshold (f) to configure the percent¬ 
age of resource sharing. For example, if each thread block 
requires Rtb units of resource, and we choose t = 0.1, 
then we allocate 1.1 * Rtb resource units per two shared 


Table I: GPGPU-Sim Architecture 


Table III: Set-2: Benchmarks that are limited by scratchpad memory 


Resource 

Number ot Clusters 

Number of Cores/Cluster 

Max Number of Thread Blocks/Core 

Max Number of Threads/Core 

Number of Registers/Core 

Scratchpad Memory/Core 

Warp Scheduling 

Number of Schedulers 

L1 -Cache/Core 

L2-Cache 

DRAM Scheduler 

GDDR3 Timings 


GPU Configuration 

“R- 

1 

8 

1536 

32768 

16KB 

LRR 

2 

16KB 

768KB 

FR-FCFS 

tRRD = 6 , twR = 12 , tRCD = 12 , tRAS = ‘ 
^RP — 12 , tRC — 40 , tcL — 12 , tcDLR = 5 


Benchmark 

Application 

Kernel 

Block Scratch- 
Size pad Size 

(JUUA-i>UK convolutioniieparable convolutionKows- 

64 

^560 


(CONVl) 

Kernel 



CUDA-SDK 

convolutionSeparable convolutionColumns- 

■ 128 

5184 


(CONV2) 

Kernel 



RODINIA 

lavamd 

kernel gpu cuda 

128 

7200 

RODINIA 

nw (NWl) 

needle cuda shared 

I 16 

2180 

RODINIA 

nw (NW2) 

needle cuda shared 

2 16 

2180 

RODINIA 

srad v2 (SRADl) 

srad cuda 1 

256 

6144 

RODINIA 

srad v2 (SRAD2) 

srad cuda 2 

256 

5120 


Table IV: Set-3: Benchmark that are limited by threads or blocks 


Table II: Set-1: Benchmarks that are limited by registers 


Benchmark 

Application Kernel 

Block Registers 

Size per thread 

GFUFU-Sim backprop 

bpnn adjust weights cuda 

256^ 

^4 

GPGPU-Sim b-i-tree 

findRangeK 

508 

24 

RODINIA 

hotspot 

calculate temp 

256 

36 

RODINIA 

LIB 

Pathcalc Portfolio KemelGPU 192 

36 

RODINIA 

MUM 

mummergpuKemel 

256 

28 

PARBOIL 

mri-q 

ComputeQ GPU 

256 

24 

PARBOIL 

sgemm 

mysgemmNT 

128 

48 

PARBOIL 

stencil 

block2D hybrid coarsen x 

512 

28 


thread blocks, which means 90% of the resource units (Rtb) 
are used as shared resource units. So for a given threshold 
t, we can compute the percentage of register sharing as 
(1 — f) = 1 = 100. For all our experimental results, we use the 
threshold value as 0.1 (i.e, 90% resource sharing), unless 
otherwise specified. 

B. Experimental Results 

We measured the performance of our approach using the 
number of Instructions executed Per Cycle (IPC), number 
of stall cycles, and number of idle cycles and compared it 
with the baseline GPGPU-Sim |0 implementation. 

We first show that resource sharing helps in increasing the 
number of thread blocks launched for the applications in Set- 
1 and Set-2. In FigureOa), we compare the effective number 
of thread blocks launched by register sharing approach 
(denoted as Shared-OWF-Unroll-Dyn) with that of baseline 
implementation (denoted as Unshared-LRR). For applica¬ 
tions MUM, backprop, hotspot, and mri-q our approach is 
able to launch 6 thread blocks (i.e, 1536 threads), which 
is the maximum limit on the number of resident threads 
per SM. Applications stencil and b+tree launch 3 thread 
blocks per SM, compared to 2 in the baseline approach. For 
applications LIB and sgemm our approach is able to launch 
8 thread blocks per SM, which is the maximum limit on the 
number of resident thread blocks. 

Similarly in Figure [Sfb), we compare the number of resi¬ 
dent thread blocks launched by scratchpad sharing (labeled 
as Shared-OWF) with baseline approach. CONVl, NWl, and 
NW2 launch upto 8 thread blocks per SM, which is the 
maximum limit on the number of resident thread blocks. 

Figure Oc) shows the improvement in IPC with register 
sharing over the baseline LRR (Loose Round Robin) im¬ 
plementation. We observe that applications b+tree, hotspot, 
MUM, and stencil achieve significant speedups of 11.98%, 


Benchmark 

Application Kernel 

Limited by 

RODINIA 

backprop 

bpnn layertbrward CUDA Threads 

GPGPU-Sim BPS 

Kernel 

Threads 

RODINIA 

gaussian 

FAN2 

Blocks 

GPGPU-Sim NN 

executeSecondLayer 

Blocks 


21.76%, 24.14%, and 23.45% respectively. Similarly Fig¬ 
ure Od) shows the performance improvement in IPC with 
scratchpad sharing. CONVl, lavaMD, and SRAD2 achieve 
significant speedups of 15.85%, 29.96%, and 25.73% re¬ 
spectively. These applications leverage all our optimizations 
to perform better. The performance improvement in IPC for 
lavaMD is due to two reasons (1) The number of resident 
thread blocks launched by our approach is twice that of base¬ 
line implementation (2) No instruction that uses scratchpad 
memory location falls into shared scratchpad, hence all the 
additional thread blocks execute instructions without waiting 
for shared thread blocks. Though LIB launches 8 thread 
blocks per SM when register sharing is enabled, it improves 
only by 0.84%. It is due to increase in L2 cache misses 
that are caused by additional shared blocks. The bench¬ 
marks backprop and sgemm achieve modest improvements 
of 5.82% and 4.06% respectively with register sharing. 
Similarly, CONVl, NWl, and NW2 show improvements 
of 4.33%, 5.62%, and 9.03% respectively with scratchpad 
sharing, mri-q slows down by 0.72%, because additional 
shared blocks increase LI cache misses and hence increase 
the number of stalls. SRAD2 shows improvement only upto 
0 .1%, because it has a barrier instruction (placed next to an 
instruction that accesses shared scratchpad memory) which 
limits the progress of shared threads that do not access any 
shared scratchpad location. 

In Figure |9la), we show the effectiveness of our proposed 
optimizations for register sharing by comparing them with 
the baseline approach. First we compare the results of 
register sharing approach when we do not use any op¬ 
timization and use the existing baseline LRR scheduling 
policy (labeled Shared-LRR-NoOpt). Consider the applica¬ 
tion hotspot, it achieves a speedup of 13.65% even without 
using any optimization because the additional thread blocks 
launched by our approach help in hiding execution latencies. 
With register unrolling optimization (labeled Shared-LRR- 
Unrolled), we further see an improvement up to 15.18%, 
because register unrolling helps to increase the usage of 
unshared registers before they start using shared registers. 
Hence the application can execute more instructions before 
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Figure 8: Comparing the number of resident thread blocks with baseline implementation (a) Registers (c) Scratchpad Memory. 
Performance improvement in IPC when compared with baseline implementation (b) for Registers (d) for Scratchpad Memory 


25% 




(a) 


Shared-LRR-NoOpt 

Shared-LRR-Unroll 

Shared-LRR-Unroll-Dyn 

Shared-OWF-Unroll-Dyn 








= 30% 
g 25% 

S 20% 
I 15% 
P-10% 
^ 5% 
o. 0% 
E -5% 


(b) 


Shared-LRR-NoOpt 

Shared-OWF 


Stall Cycles i 
Idle Cylces i 



Stall Cycles i 
Idle Cycles i 



Figure 9: Performance analysis of optimizations for (a) Register Sharing (c) Scratchpad sharing percentage decrease in stalls and idle cycles for (a) 
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shared-OWF approach. It shows 49.5% decrease in idle cycles. 
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Performance comparison of (a) Scratchpad sharing (c) Register sharing with 2-Level (baseline) scheduler 


it accesses shared registers. When we enable the dynamic 
warp execution (labeled Shared-LRR-Unrolled-Dyn), we see 
an improvement only up to 14.58%, because it limits the 
execution of memory instructions from non-owner warps. 
However when we apply the OWF optimization (labeled 
Shared-OWF-Unrolled-Dyn), the application speeds further 
up to 21.76%. In the presence of OWF optimization, the 
priority of non-owner warps decreases compared to the 
other warps. Hence the memory instructions issued by non- 
owner warps do not interfere with the other warps, which 
minimizes the L1/L2 cache misses. We can see that b+tree 
also behaves similar to hotspot application in terms of 
performance gain by varying the optimizations. 

For application MUM, when we do not use any opti¬ 
mization, there is a slow down of 0.15%. We observe that, 
increase in the resident thread blocks leads to increase in 
the LI and L2 cache misses that arise by issuing memory 
instructions from the non-owner warps. Though we see an 
increase in the L1/L2 cache misses, the other instructions 
issued by the non-owner warps help in minimizing the 


stall cycles. With register unrolling optimization, we see a 
slight improvement (0.08%). When we apply the dynamic 
warp execution, it shows a speed up of 6.45%. From this 
we analyze that dynamic warp execution reduces the addi¬ 
tional stall cycles produced by issuing memory instructions 
from the non-owner warps. Further with OWF optimization, 
performance improvement goes up to 24.14%, because of 
decrease in interference from the non-owner warps. 

LIB shows an improvement of 2% in the presence of 
sharing with no optimizations. We observe the same perfor¬ 
mance even with unrolling optimization, because the number 
of instructions that use unshared registers before the first 
instruction using shared registers is exactly the same as 
without the optimization. With dynamic warp execution, we 
still observe the same results, since in this application all the 
the owner warps have completed executing all instructions 
before any non-owner warp starts issuing any memory 
instructions. With OWF optimization, we observe a small 
degradation because of increase in the number of stall cycles 
compared to the LRR scheduling policy. 






















































The benchmarks sgemm, backprop, and stencil achieve 
good improvements only when OWF optimization is en¬ 
abled. Since instructions issued by non-owner warps execute 
with the least priority, they do not interfere with other 
warps and hence minimize L1/L2 cache misses. We do not 
see any performance improvement with mri-q, because the 
additional thread blocks increase LI cache misses with our 
approach. However the slow down was reduced to 0.72% in 
the presence of all the optimizations. 

To summarize, memory bound applications, like MUM, 
take advantage of our proposed sharing approach in the 
presence of dynamic warp execution and OWF optimiza¬ 
tions. Whereas, compute bound applications, like hotspot, 
perform better even without any optimizations, and they 
further improve with the OWF optimization. 

In Figure llfb), we show the effect of OWF optimization 
on scratchpad sharing. lavaMD shows an improvement of 
28% even without any optimization (labeled shared-LRR- 
NoOpt). It is because the additional thread blocks do not ac¬ 
cess any memory location which belong to shared scratchpad 
memory. CONVl, CONV2, SRADl, and SRAD2 applications 
show improvements of 5.68%, 6.21%, 11.1%, and 5.28% 
respectively without applying optimization, which is due to 
additional thread blocks that help in hiding the latencies. 

With OWF optimization, CONV2, NWl, NW2, and SRAD2 
applications improve upto 15.85%, 5.62%, 9.03%, and 
25.73% respectively. Since OWF optimization schedules 
the owner warps efficiently, it helps in minimizing stall 
cycles thus improving IPC value. lavaMD improves upto 
30%, since it has more benefit with sharing than OWF 
optimization. CONVl and SRAD2 perform better when no 
optimization is applied, because these applications observe 
more number of LI, L2 misses and stall cycles with OWF 
optimization when compared to no optimization. 

In Figure |9jc) and (d), we report percentage decrease in 
the number of idle cycles (Cycle in which all the available 
warps are issued, but no warp is ready to execute) and stall 
cycles (Pipeline stall) compared to the baseline implemen¬ 
tation. We observe that, all applications show reduction in 
the number of idle cycles (max up to 99%). This is expected 
because with increase in the number of thread blocks, the 
number of instructions that are ready to execute also in¬ 
creases. For applications MUM, LIB, backprop, hotspot, and 
stencil the number stall cycles also reduce with our register 
approach. Similarly for CONV2, NWl, NW2, SRADl, and 
SRAD2 applications the number of stall cycles reduce with 
scratchpad sharing. It indicates the additional thread blocks 
launched with our approach hide the long execution latencies 
in a better way. We observe an increase in the stall cycles for 
applications b+tree and stencil. However, since the number 
of idle cycles have significantly reduced, overall we see a 
benefit with our approach. For mri-q, the number of stall 
cycles increase with our approach due to increase in the 
number of LI cache misses. lavaMD shows increase the 
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Figure 11: Comparison with LRR that uses twice the number of (a) 
Registers (b) Scratchpad 


number of stall cycles by 259, this is because the additional 
threads launched by our approach are waiting for execution 
units (SP units) to become ready. Also with CONVl, we see 
an increase in number of stalls, since our approach has more 
LI cache misses compared to the baseline approach. 

In Figures HOja) and (b), we show the performance 
improvement in register and scratchpad sharing approach 
over GTO (Greedy Then Old) scheduler respectively. We 
observe that our approach shows an improvement up to 3.9% 
with register sharing and shows an improvement upto 30% 
with scratchpad sharing. Further, as shown in Figure fTOl' c') 
and (d) we observe an improvement up to 27.22% in IPC 
with register sharing, and improvement upto 27.08% with 
scratchpad sharing over the two-level scheduling policy. 

We also measure the effectiveness of resource sharing 
mechanism by comparing it with LRR Scheduler that uses 
twice the number of resources. In Figure fl^ al. the base¬ 
line approach (labeled Unshared-LRR-Reg#65536) uses 64K 
registers, where as our approach uses only 32K registers. 
Even with an increase in the number of registers and hence 
an increase in the number of resident thread blocks in the 
baseline approach, our approach performs better in 5 out 
of 8 applications. For example, application MUM performs 
better with our approach, though we see the same number 
of thread blocks (6) in both the approaches as dynamic warp 
execution optimization helps minimizing the stalls produced 
by the additional thread blocks. Applications sgemm, b+tree, 
and LIB perform better with the baseline approach due to an 
increase in the number of resident thread blocks and hence 
an increase in the number of active warps. In Figure flTl' bL 
we compare scratchpad sharing approach that uses 16K bytes 
of memory with that of baseline approach that uses 32K byes 
of memory. From the Figure we observe that, performance 
of CONVl, NWl, and AW2 is comparable to that of baseline 
approach, because our approach can launch the number of 
resident thread blocks which is equal to that of baseline 
approach. lavaMD performs better than baseline approach, 
because of sharing helps in minimizing latencies, and also 
OWF optimization helps in scheduling the warps efficiently. 
CONV2, SRADl, and SRAD2 degrades with our approach, 
because (1) number of resident thread blocks in our approach 
is less than that of baseline approach, and (2) the number 
of stall cycles observed in our approach is more than that 
of baseline implementation. 
















Table V: Effect on IPC with register sharing 


Table VII: Effect on IPC with scratchpad sharinj 


% Sharing U% 



^Wc 

50% 

70% 

90% 

backprop 389.9 



389.9 

389.9 

394.1 

392.8 

b-Hree 318.5 


318.5 

318.5 

323.3 

326.1 

326.1 

hotspot 489.5 


489.5 

489.5 

475.2 

476.9 

503.59 

LIB 218.0 


218.0 

203.0 

203.0 

216.3 

223.3 

MUM 190.5 


190.5 

190.5 

192.1 

192.4 

194.9 

mri-q 303.7 


303.7 

303.7 

303.7 

305.3 

305.0 

sgemm 490.6 


490.6 

490.6 

490.6 

446.3 

496.7 

stencil 448.2 


448.2 

448.2 

448.2 

448.2 

440.8 

Table VI: Effect 

on 

resident thread blocks with register sharing 

% Shanng 

TWv — m% 

-30^ 

50% 

70% 

90% 

backprop 


5^ 

5 

5 

6 

6 

b+tree 

2 

2 

2 

3 

3 

3 

hotspot 

3 

3 

3 

4 

4 

6 

LIB 

4 

4 

5 

5 

6 

8 

MUM 

4 

4 

4 

5 

5 

6 

mri-q 

5 

5 

5 

5 

6 

6 

sgemm 

5 

5 

5 

5 

6 

8 

stencil 

2 

2 

2 

2 

2 
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1} Effect of sharing on performance: In Table [V] and 
Table rvlll. we analyze the performance of resource sharing 
approach with the amount of sharing. From the results we 
observe that, most of the applications perform better when 
the amount of sharing is 90%. It is because, as shown 
in Tables |Vl] and IVIIII , with increase in the amount of 
resource sharing, the number of resident thread blocks will 
increase. These resident thread blocks help in hiding long 
latencies and hence help in achieving high throughput. From 
the Tables |V] and IVIII we also notice that, all applications 
behave same at 0% and 10% sharing. At these percentages 
of sharing, the number of resident thread blocks with our 
approach is same as that of baseline implementation. Hence 
at run time, our approach decides to launch all the thread 
blocks in the unsharing mode. Since all these blocks are 
in unsharing mode, all the warps become unshared warps. 
In this case, OWF optimization uses dynamic warp ids to 
schedule warps and achieves higher performance than the 
baseline approach. SRAD2 (Table rviii i performs better at 
50%, because at this sharing, the number of instructions that 
get executed before they start enter the shared scratchpad 
memory region is more than that at 90%. Also at 50% 
sharing, these extra instructions belong to loop statements, 
hence we observe more IPC values. 

2 ) Performance analysis of Set-3 benchmarks: The per¬ 
formance of register sharing and scratchpad sharing ap¬ 
proach for the Set-3 applications (Table IIVI) is presented 
in Figures fT^ cl and (d) respectively. As discussed earlier, 
these applications are not limited by the number of available 
resources but due to other factors such as the number of 
threads, thread blocks, etc. We measure their performance 
when our approach uses (1) LRR scheduling policy, (^ 
GTO scheduling policy, and (3) OWF scheduling policyj. 
From Figures fTSl' cl and (d), we observe that our proposed 
resource sharing approach when used with LRR scheduling 
(labeled as Shared-LRR-Unroll-Dyn) performs exactly same 

^We do not use two-level scheduling policy, because it cannot be directly 
integrated with our sharing approach 


% Sharing 

0% 

10% 

T0% 

50% 

70^ 

90^ 

CON VI 

280.33 

280.33 

280.33 

280.33 

288.82 

292.24 

CONV2 

119.29 

119.29 

119.29 

119.29 

119.02 

124.6 

lavaMD 

452.29 

452.29 

452.29 

452.29 

452.29 

578.85 

NWl 

39.96 

39.96 

39.96 

38.67 

38.37 

38.37 

NW2 

41.93 

41.93 

41.93 

42.14 

40.54 

39.72 

SRADl 

188.13 

188.13 

188.13 

229.38 

208.27 

204.32 

SRAD2 

63.48 

63.48 

63.48 

63.52 

63.62 

68.29 

Table VIII: Effect 

: on resident thread blocks with scratchpad sharing 
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2 2 
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7 7 
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2 2 

2 

3 
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4 

SRAD2 

3 3 

3 

3 

3 
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as the baseline LRR scheduling (Unshared-LRR). Since 
the number of thread blocks launched by the applications 
are not limited by the resources, our approach does not 
launch any additional thread blocks, and all the thread blocks 
are in unsharing mode. Hence, it behaves exactly similar 
to the baseline approach. Similarly, our approach when 
used with the GTO scheduling policy (Shared-GTO-Unroll- 
Dyn), performs exactly same as the baseline approach that 
uses GTO scheduling policy without sharing (Unshared- 
GTO). Finally, we observe that with OWF scheduling policy 
(Shown as Shared-OWF-Unroll-Dyn), our approach is com¬ 
parable to that of Unshared-GTO implementation. In OWF 
optimization, the warps are arranged according to priority 
of owner, unshared, and non-owner warps. Since in this 
case, we don’t launch any additional thread blocks, all the 
thread blocks are in unshared mode. Hence all the unshared 
warps are sorted according to their dynamic warp id. So the 
performance of Shared-OWF-Unroll-Dyn is similar to that 
of Unshared-GTO implementation. 

VH. Related Work 

Xiang et. al. 11 discussed thread block level resource 
management. They proposed a hardware solution to launch 
a partial thread block when there are not enough resources 
to launch a full thread block. Unlike our approach, their 
solution can have only one partial thread block running. 
A patented register management scheme in E, uses the 
concept of virtual registers, which are more than the actual 
physical registers, and hence can launch more thread blocks 
than allowed by the physical registers. This mechanism can 
be combined with our proposed solution. Yang et. al. 11 
propose hardware and software solutions to the problem 
caused by allocation and deallocation of shared memory at 
the thread block granularity. 

Warped Register File 121 describes a solution to reduce the 
power consumption in register file by turning off unallocated 
registers. Gebhart et. al. HI proposed a unified memory 
for register, scratchpad, and primary cache, which partitions 
resources of SM as per the application need. It requires lot 
of hardware changes to access unified storage, in contrast 
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Figure 12: Performance analysis of Set-3 applications for (a) Register 
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our approach requires fewer modifications to the hardware. 
Other works HD, HI, ini, IH, fBl, QH, QT), QS] pro¬ 
pose hardware and software solutions to improve throughput 
of GPUs by handling branch and thread divergence, but these 
are orthogonal to our approach. 

Other techniques to improve GPU performance include 
reducing cache contention, improving DRAM bandwidth, 
hide long latencies, reduce energy consumption, etc. Rogers 
et. al. |[T9l propose a cache conscious wave front scheduling 
algorithm which makes use of intra-wave front locality 
detector, focusing on shared LI cache. A Two level warp 
scheduler ll20l proposed by Narasiman et. al. divides warps 
into groups and schedules the warps in each group in round 
robin manner to hide long latencies in a better way. Gebhart 
et. al. ^T^\ proposed energy efficient hierarchical register file 
storage and two level warp scheduler for high throughput 
processors. OWL fT2\ proposes various techniques to im¬ 
prove cache contention and DRAM bank level parallelism. 


VIII. Conclusions and Future Work 

In this paper we proposed a technique that shares re¬ 
sources of SM to effectively utilize the wasted resources 
by launching additional thread blocks in each SM. For 
effective utilization of these additional thread blocks, we 
proposed optimizations which further help in reducing the 
stalls produced in the system. We validated our approach 
for register sharing and for scratchpad sharing on several 
applications, and showed improvements up to maximum 
24% and average 11% with register sharing, and maximum 
30% and average 12.5% with scratchpad sharing. 

In future, we plan to incorporate traditional compiler 
analysis and optimizations into our approach. For example, 
live range analysis along with instruction reordering can be 
used to detect and release registers that are not used beyond 
a point. Such registers, if shared, can be used by the warp in 
the other thread block waiting for shared registers. We also 
plan to study the effect of various cache replacement policies 
on register sharing and use it to improve the throughput of 
memory bound applications. 
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